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Abstract 


Introduction 


The National Combustion Code (NCC) is 
being developed by an industry-government 
team for the design and analysis of combus- 
tion systems. CORSAIR-CCD is the current 
baseline reacting flow solver for NCC. This is 
a parallel, unstructured grid code which uses 
a distributed memory , message passing model 
for its parallel implementation. The focus of 
the present effort has been to improve the 
performance of the NCC flow solver to meet 
combustor designer requirements for model 
accuracy and analysis turnaround time. Im- 
proving the performance of this code con- 
tributes significantly to the overall reduction 
in time and cost of the combustor design cy- 
cle. This paper describes the parallel imple- 
mentation of the NCC flow solver and sum- 
marizes its current parallel performance on 
an SGI Origin 2000. Earlier parallel per- 
formance results on an IBM SP-2 are also 
included. The performance improvements 
which have enabled a turnaround of less than 
15 hours for a 1.3 million element fully react- 
ing combustion simulation are described. 


*Member AIAA. 

♦^Aerospace engineer, Senior member AIAA. 


The National Combustion Code (NCC) is an 
integrated system of computer codes being 
developed by an industry-government team 
for the design and analysis of combustion sys- 
tems. The objective of this effort is to de- 
velop a multidisciplinary combustion simula- 
tion capability that will provide detailed anal- 
yses during the design process of combustors 
for gas turbine engines. NCC will enable the 
analysis of a full combustor from compressor 
exit to turbine inlet. Such a system is critical 
for optimizing the combustor design process. 

The primary flow solver for NCC is 
CORSAIR-CCD. This is a Navier-Stokes flow 
solver based on an explicit four stage Runge- 
Kutta scheme. The original code (COR- 
SAIR) was developed by Pratt k Whitney 1 
and was designed from the beginning to 
use unstructured grids and parallel process- 
ing. The code has since been upgraded 
by NASA Glenn with new models (chem- 
istry, spray, turbulence) and enhanced par- 
allel processing". 

The Numerical Propulsion System Sim u- 
lation (NPSS) project at NASA Glenn has 
supported the NCC flow solver performance 
enhancement effort. An NPSS milestone to 
use NCC to run a large scale, fully react- 
ing combustor simulation within an overnight 
turnaround time of 15 hours was met Septem- 
ber 1998. The effort to meet this milestone 
along with subsequent performance impro la- 
ments will be described in the Performance 
Improvements section. 
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Parallel Implementation 

The memory requirements of large scale, fully 
reacting simulations of real world combustors 
was one of the primary motivators for origi- 
nally parallelizing the CORSAIR code. Prob- 
lems of this size and complexity would not 
fit within the memory of traditional super- 
computers. Running CORSAIR in parallel 
across a cluster of workstations allowed solv- 
ing problems w'hich could not be solved on 
traditional supercomputers. 

Developing a parallel application requires 
addressing a number of issues including do- 
main decomposition, process organization, 
message passing requirements and portabil- 
ity. 

Domain Decomposition 

The large memory requirements for the simu- 
lation of real world combustors dictated that 
a parallel code be developed using domain de- 
composition. A distributed memory, message 
passing model w'as used. The current domain 
decomposition method for NCC is relatively 
simple. It is based on the number of compu- 
tational elements in the simulation geometry 
and on the number of available processors. 
During a pre-processing stage the cells of the 
unstructured grid are re-ordered to run con- 
secutively along the longest axis of the grid. 
The number of cells is then evenly divided 
among the number of available processors. 
The last processor takes on any ‘extra’ cells 
if the division between processes is not even. 
These extra cells are typically not a signifi- 
cant factor with the overall load balance. 

This “on-the-flv” domain decomposition 
allows the user to select the number of proces- 
sors used by the simulation at startup, based 
on processor availability. The load is well bal- 
anced across all processors rather than being 
statically determined during grid generation. 
However, no effort is made to minimize the 
size of messages exchanged between processes 
by minimizing the number of cells along the 
process interface boundaries. This will be ad- 
dressed in the future. 


Process Organization 

A Single Program Multiple Data (SPMD) 
strategy was used with CORSAIR-CCD. All 
processes are computational processes con- 
sisting of a copy of CORSAIR-CCD operating 
on its own local domain and exchanging in- 
formation with neighboring processes which 
share common ceil faces. 

Depending on the geometry, each process 
typically communicates with at most two 
neighboring processes. As the number of 
processors increases and the domain is more 
finely divided however the number of commu- 
nication partners per process can increase. 

Message Passing Requirements 

The CORSAIR-CCD code solves 19 partial 
differential equations in the benchmark con- 
figuration used in this study. The chemistry 
is modelled by a 12-species, 10-step reduced 
kinetics mechanism. The message passing re- 
quired to handle these computational require- 
ments in the original code was significant. 
Depending on the given simulation each pro- 
cess exchanged as many as 563 messages wdth 
a neighboring process each iteration. The 
amount of message passing in the code has 
been reduced significantly through various 
code enhancements which wdll be described 
in the Performance Improvements section. 

Portability 

NCC is intended to be used by a wide audi- 
ence and therefore should be portable to a va- 
riety of platforms and parallel environments. 
A message interface layer was created to al- 
low' the use of either MPI or PVM message 
passing libraries with no modification to the 
code. The code can also be compiled to run 
in serial mode if desired. 

Performance metrics were added to the 
message interface layer, w : hich allows the 
computation of process and message pass- 
ing statistics as desired. These metrics assist 
with tuning the CORSAIR-CCD code to a 
particular architecture. 

A makefile structure was designed to sim- 
plify building CORSAIR-CCD on various 
platforms using either message passing li- 
brary. The platforms on which CORSAIR- 
CCD has run include the the SGI Origin 2000, 
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IBM SP-2 and HP Exemplar, along with clus- 
ters of workstations (IBM. SGI and SUN) and 
PCs (LINUX). CORSAIR-CCD has also been 
ported to the Windows NT environment. The 
NT version of the code is maintained sepa- 
rately from the UNIX based version and has 
not been used extensively. 


Performance Measurement 

Benchmark Test Cases 

The test case used for this performance evalu- 
ation was the Lean Direct Injection/Multiple 
Venturi Swirler (LDI-MVS) combustor which 
is a full 3-D planar periodic sector rig 3 . 
This combustor has been proposed for use 
in the evaluation of advanced low emission 
combustor concepts. The computational do- 
main consists of 443,926 tetrahedral elements. 
The problem size of interest for the NPSS 
overnight turnaround milestone however is 
approximately 1.3 million elements. Until a 
geometry of this size becomes available the 
execution time of the smaller 444k element 
test case is being scaled up by a factor of three 
to estimate the performance of the larger 
problem. Recently a 971,054 tetrahedral ele- 
ment version of the LDI-MVS combustor ge- 
ometry became available. The results for this 
test case are being scaled by a factor of 1.34 to 
also estimate the execution time of the larger 
(1.3 million element) problem. 

A 12 species, 10 step reduced kinetics 
mechanism is being used to account for the 
amount of computational resources required 
for the chemistry simulation. Unless other- 
wise stated, all turbulence, species and en- 
thalpy equations are turned on during bench- 
marking. Convergence for a fully reacting 
solution is estimated to require 10000 itera- 
tions. 

Benchmark Hardware Platforms 

The CORSAIR-CCD parallel enhancement 
effort initially began in 1995 using an IBM 
SP-2. This machine consisted of 144 IBM 
RS6000/590 processors interconnected by a 
high speed switch. All performance met- 
rics were recorded in a dedicated environment 


where processor usage w^as restricted to one 
user job per node. Network activity on the 
SP-2 may have competed with other active 
users. However, the effect appeared to be 
minimal as benchmark results were highly re- 
peatable from one run to the next. IBM’s 
version of MPI, which was tuned to use the 
IBM SP-2 high speed switch, was used for 
benchmarks on this platform. 

In 1998 an SGI Origin 2000 replaced the 
IBM SP-2 as the benchmark hardware for 
this performance evaluation. The SGI Origin 
2000 is a cache coherent non-uniform mem- 
ory access (ccNUMA) architecture. The ma- 
chine appears to the user as a shared memory 
machine however memory is physically dis- 
tributed among the processing nodes. Initial 
benchmark results on this platform w : ere aver- 
aged over multiple runs in a lightly loaded en- 
vironment. Recent benchmarks on this plat- 
form were run on a dedicated 64 processor 
system to ensure repeatability of the bench- 
mark measurements. SGI’s version of MPI 
was used for all benchmarks on this platform. 

Metrics 

The metric of most interest in this perfor- 
mance improvement effort has been the time 
required to reach a solution for a 1.3 million 
element problem. This “estimated time to 
solution” is calculated by timing the main it- 
eration loop of CORSAIR-CCD after one full 
iteration has completed. This allows systems 
which load code incrementally from disk to 
complete an entire cycle before benchmark 
timing begins. The main iteration loop is 
timed for a fixed number of iterations so 
an average time per iteration can be calcu- 
lated. The number of iterations timed varies 
depending on the size of the problem and 
the number of processors used. With recent 
benchmarks on the SGI Origin 2000 typi- 
cally 200 iterations are timed. The estimated 
time required to complete 10000 iterations for 
the 1.3 million element problem is then esti- 
mated using the appropriate scaling factor for 
the test case. The goal of this effort was to 
achieve a 15 hour estimated time to solution 
for a 1.3 million element problem. 

The initialization and termination sections 
of the code are excluded from benchmark tim- 
ing since these sections of code consume very 
little time relative to the time required to 
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reach a solution. This time segment would 
have been enough to skew the results of a 200 
iteration benchmark result. 

Parallel speedup and efficiency are calcu- 
lated to indicate how well CORSAIR-CCD 
uses the available parallel resources. The par- 
allel speedup metric is calculated by taking 
the ratio of the time per iteration for the se- 
rial case vs. the time per iteration for the 
parallel case. The parallel efficiency is the ra- 
tio of the parallel speedup to the number of 
processors used in the calculation. A desire 
to keep the parallel efficiency above 80% de- 
termined the maximum number of processors 
used at any point during this performance 
improvement effort. These metrics may in- 
dicate potential for further performance im- 
provement. 

The benchmark test cases were too large 
to run within a single node of the IBM SP- 
2. Since speedup and parallel efficiency could 
not be calculated traditionally, an estimated 
speedup was calculated using the time per it- 
eration for the code running on the smallest 
possible number of processors. This time was 
assumed to be linear, and it was used to ex- 
trapolate the single processor time per itera- 
tion from which the speedup curve could be 
estimated. 


Performance Improvements 


Efforts to achieve a 15 hour turnaround with 
CORSAIR-CCD focused on the steady state 
problem only. The baseline performance of 
the original code is described below , along 
with the code enhancements and hardware 
upgrades listed in roughly chronological order 
which have significantly contributed to the 
improved performance of CORSAIR-CCD. 

Baseline Performance 

CORSAIR-CCD was initially ported to an 
IBM SP-2. The 444k element benchmark 
test case consumed 61.4 seconds per itera- 
tion when running on 64 processors of the 
IBM SP-2. It was estimated that a solution 
for a 1.3 million element simulation could be 
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Figure 1: Performance improvement for the 
LDI-MVS 444k element test case due to algo- 
rithm modifications. 


reached within 512 hours using this baseline 
code. This was the starting point for the per- 
formance improvement effort. 

Algorithm Modifications 

CORSAIR-CCD uses a four stage Runge 
Kutta algorithm. Originally the convective, 
viscous and artificial dissipation terms were 
computed at each stage. This algorithm was 
modified so that the viscous and artificial dis- 
sipation terms are now computed at the first 
stage and then held constant for the remain- 
ing stages. The convective terms continue to 
be computed at every stage. This modifica- 
tion eliminated substantial computation and 
cut the required message passing in half, from 
as many as 563 messages being exchanged 
between communication partners per itera- 
tion to at most 286 messages exchanged per 
iteration. Arrays being exchanged between 
processes were being transmitted as individ- 
ual messages. Some of these arrays were 
packed together into fewer, larger messages, 
further reducing the number of messages ex- 
changed per iteration to 190. The estimated 
time to reach a solution for a 1.3 million el- 
ement problem using 64 processors was re- 
duced from 512 hours to 299 hours (Figure 
!)• 

The 444k element benchmark test case was 
too large to run within a single node of the 
IBM SP-2 so an estimated speedup curve 
was calculated based on the eight processor 
time per iteration (Figure 2). An estimated 
speedup appears to be reasonable in this sit- 
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Figure 2: Estimated speedup for the LDI- 
MVS 444k element test case following algo- 
rithm modifications on the IBM SP-2. 

uation since it was observed that the time re- 
quired per iteration for the 16 processor sim- 
ulation was half the time required for the 8 
processor simulation. This 16 processor simu- 
lation scaled well enough to assume the eight 
processor result was near linear and could be 
used to extrapolate a single processor time 
per iteration. 

The best case solution following the algo- 
rithm modifications used 84 processors. The 
estimated time to reach a solution for a 1.3 
million element problem was 238 hours, with 
an estimated speedup of 74.8 and an esti- 
mated parallel efficiency of 89%. 

Code Streamlining 

The gprof profiling tool was used on the IBM 
SP-2 to determine which routines were con- 
suming the most time during a typical react- 
ing flow* simulation. It was determined that 
two finite rate chemistry subroutines called 
four times per iteration for each computa- 
tional element required 54% of the code’s ex- 
ecution time. In one of these routines a state- 
ment which executed an exponentiation oper- 
ation (a**0.25) was replaced by square root 
intrinsics (sqrt(sqrt(a)), yielding a 21% im- 
provement in performance when using 84 pro- 
cessors. 

These two chemistry subroutines were orig- 
inally written to operate over a variable 
number of computational elements. Since 
CORSAIR-CCD called these routines once 
per element the indexing of temporary vari- 
ables was unnecessary and could be elimi- 


nated. This resulted in a 10% performance 
savings. It is likely that the compiler could 
not optimize these routines properly due to 
the elaborate indexing and temporary vari- 
ables in the original routine. 

Some calculations which were constant for 
all iterations were relocated to the initial- 
ization section of the code rather than be- 
ing recomputed on each call to these subrou- 
tines. This resulted in a 13.4% improvement 
in performance. Finally, several division op- 
erations were replaced by their multiplicative 
inverse, resulting in another 8.6% improve- 
ment in performance. 

With these modifications the 444k element 
test case required 14.8 seconds per iteration 
when using 84 processors. The estimated 
time to solution for a 1.3 million element 
problem was 123 hours. Following these mod- 
ifications gprof indicated that the two modi- 
fied chemistry subroutines still consumed ap- 
proximately 32% of the code’s execution time 
on the IBM SP-2 indicating the finite rate 
chemistry routines continued to be dominant. 

Deadlock Elimination 

The original communication scheme in 
CORSAIR-CCD involved all processes send- 
ing to and then receiving from their neigh- 
bors. When CORSAIR-CCD was ported to 
the IBM SP-2, deadlock problems were en- 
countered with this scheme due to limited 
message buffering capability on that plat- 
form. To resolve this situation an odd/even 
communication scheme was implemented. 
This scheme assumed all processes could be 
mapped to a ring topology, with each pro- 
cess communicating with at most two neigh- 
bors. All even processes performed send and 
then receive operations, while all odd pro- 
cesses performed corresponding receive and 
then send operations. This solution resolved 
the deadlock conflict initially, however this 
scheme fails when the process topology be- 
comes more complex. This failure was en- 
countered with the 444k element benchmark 
test case when the number of processors in- 
creased past 84. In this situation the num- 
ber of communication partners for some pro- 
cesses increased from two to three. A dead- 
lock condition resulted once again because 
the communication pattern could no longer 
be mapped to a ring topology. 
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Figure 3: Example graph with three commu- 
nication stages. 


To resolve this deadlock situation, a graph 
coloring algorithm 4 was implemented to de- 
termine a deadlock free communication pat- 
tern between an arbitrary configuration of 
processes. A graph represents the communi- 
cation requirements between processes. The 
processes are represented by vertices in the 
graph. An edge between two vertices indi- 
cates that the corresponding processes ex- 
change messages. The edges are “colored” 
so that each edge from any one vertex is a 
unique “color”. The communication pattern 
is dictated by the resulting colors. The num- 
ber of colors represents the number of com- 
munication stages. In the example in Figure 
3 there are three communication stages. For 
each communication stage, all process pairs 
exchange messages. 

This new algorithm eliminated the dead- 
lock problem and allowed increasing the num- 
ber of processors used on the IBM SP-2 from 
84 to 96. It was discovered that the new 
communication pattern was slightly more ef- 
ficient because it took advantage of the un- 
derlying parallelism of the IBM SP-2 inter- 
connection network. The performance when 
using the new communication pattern with 
84 processors improved by 3%. The time 
per iteration for the 444k element test case 
when using 96 processors was 13.0 seconds 
per iteration. Figure 4 illustrates the esti- 
mated speedup curve for CORSAIR-CCD on 
the IBM SP-2. The estimated speedup us- 
ing 96 processors was 80.4 with an estimated 
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Figure 4: Final estimated speedup on the 
IBM SP-2. 

parallel efficiency of 84%. It w : as estimated to 
require 108 hours to reach a solution for the 
1.3 million element problem. 

Hardware Upgrade 

CORSAIR-CCD was ported to an SGI Origin 
2000 with 195 MHz CPUs in 1998, replacing 
the IBM SP-2 as the benchmark hardware 
platform. Initially 32 processors were used 
for benchmarking on the SGI Origin. When 
comparing the 32 processor performance of 
the 444k element test case between the IBM 
and SGI platforms, the Origin 2000 proved 
to be a factor of 3.4 faster than the IBM SP- 
2. Of more interest however is the best case 
performance of the 444k element test case on 
each platform. The best case performance 
on the IBM SP-2 used 96 processors and re- 
quired 13.0 seconds per iteration. The initial 
best case performance on the Origin used 32 
processors and required 10.1 seconds per iter- 
ation. Therefore a 1.3x improvement in per- 
formance for the 444k element test case was 
realized by switching from the IBM SP-2 to 
the Origin 2000 platform. The Origin 2000 
processors were later upgraded to 250 MHz 
resulting in an additional l.lx improvement. 

ILDM Kinetics Module 

A new Intrinsic Low' Dimensional Manifold 
(ILDM) Kinetics module 5 ’ 6 w'as integrated 
into CORSAIR-CCD to be used in place 
of the existing finite rate chemistry mod- 
ule. The original chemistry module required 
solving 12 species equations for the current 
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benchmark configuration while the ILDM 
Kinetics module requires solving only two. 
The remaining species are obtained from an 
ILDM table which is generated during a pre- 
processing stage. The reduction of species 
does not give any appreciable reduction in 
the fidelity of the results. The advantage to 
this technique is that computation is signif- 
icantly reduced. The CORSAIR-CCD code 
solved 19 equations using the original finite 
rate chemistry module whereas only nine are 
solved when using the ILDM Kinetics mod- 
ule. Slightly more memory is required (10 
MB/process) to store the ILDM tables, how- 
ever this can be offset completely by eliminat- 
ing storage for the 10 species which no longer 
need to be calculated. 

Properties such as density, viscosity and 
temperature can also be obtained from the 
ILDM tables, further reducing the amount of 
computation required. Also for some cases 
the enthalpy calculations can be turned off, 
reducing the number of equations to be solved 
to seven. The enthalpy calculations could be 
turned off for the benchmark test case. 

When using this module, message passing 
costs are completely eliminated for 10 species, 
along with their associated derivative and 
flux terms. Also if enthalpy can be turned 
off the message passing of enthalpy variables 
can be eliminated. For the benchmark test 
case the number of messages exchanged be- 
tween communication partners dropped from 
190 to 58 per iteration due to the addition of 
the ILDM Kinetics module. 

Initial benchmarks on the SGI Origin 2000 
indicate using the ILDM Kinetics module 
in place of the original finite rate chemistry 
module resulted in a 4.8x improvement in 
performance for the benchmark test case. 
The 444k element test case required 10.1 sec- 
onds per iteration when using 32 processors 
with the original finite rate chemistry mod- 
ule. The corresponding test case using the 
ILDM kinetics module required 2.1 seconds 
per iteration. With the ILDM kinetics mod- 
ule the time required to reach a solution for 
a 1.3 million element problem was estimated 
at 18 hours. 

FORTRAN I/O Library 

Performance began to taper off on the SGI 
Origin 2000 when increasing the number of 



Figure 5: Performance differences for the 

444k element test case due to the I/O library 
on the Origin 2000. 

processors from 32 to 48, with the paral- 
lel efficiency dropping to 80%. Performance 
dropped off even more significantly above 
56 processors. Scaling improved by switch- 
ing from SGI’s f77 compiler to the f90 com- 
piler (Figure 5). It was discovered that the 
initialization time, where all processes read 
the same geometry file, was cut considerably 
when using f90. This pointed to the I/O li- 
brary as the source for the improved perfor- 
mance. During the main iteration loop all 
processes printed their residual to the stan- 
dard output file. This had been added to 
monitor the progress of the solution and in- 
advertently had not been removed for bench- 
marking. The f90 I/O library handled this 
much more efficiently than the f77 I/O library 
and a 9.4% improvement in performance was 
realized when using f90 with 56 processors. 
Once this extraneous I/O was eliminated, the 
f77 performance for the main iteration loop 
matched the f90 performance. 


Performance Summary 

Current benchmarks with the 444k element 
test case on the Origin 2000 using 56 proces- 
sors require 1.08 seconds per iteration. This 
includes performance improvements due to 
the use of new compiler optimization flags as 
well as a recent change to the ILDM kinet- 
ics module algorithm which resulted in a 6% 
improvement in performance. The packing 
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Figure 6: Speedup for the LDI-MVS 444k el- 
ement test case on the SGI Origin 2000. 


of multiple arrays into fewer, larger messages 
was also eliminated, increasing the number of 
messages exchanged between processes from 
58 to 127. The smaller message sizes were 
originally believed to improve performance on 
the Origin however recent benchmarks indi- 
cate this has little effect. Further investiga- 
tion is required. It is now estimated to require 
9 hours to reach a solution for a 1.3 million el- 
ement problem. The speedup is 46.0 and the 
parallel efficiency is 82%. Figure 6 illustrates 
the speedup curve for the current code. 

The performance results for the 971k ele- 
ment benchmark test case supports the con- 
clusion that the 1.3 million problem could 
be solved in less than 15 hours to meet the 
NPSS overnight turnaround milestone. Us- 
ing 56 processors on the Origin 2000 the 971k 
element test case requires 2.25 seconds per it- 
eration. It is estimated to require 8.4 hours 
to reach a solution for a 1.3 million element 
problem. The speedup is 45.5 and the par- 
allel efficiency is 81%. The speedup curve is 
illustrated in Figure 7. It was noted that the 
speedup and efficiency for the larger 971k el- 
ement test case is slightly less than the 444k 
element test case. This is due to the larger 
messages exchanged with the 971k element 
test case. 

The overall results since the CORSAIR- 
CCD performance improvement effort be- 
gan in 1995 are illustrated in Figure 8. At 
times some modifications to improve numeri- 
cal accuracy have negatively impacted perfor- 
mance. The original CORSAIR-CCD code 
exchanged 563 messages per iteration be- 
tween communication partners. Algorithm 

NASA/TM— 2000-209801 8 



Figure 7: Speedup for the LDI-MVS 971k el- 
ement test case on the SGI Origin 2000. 
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Figure 8: Overview' of performance improve- 
ment since 1995. 


modifications and the ILDM Kinetics mod- 
ule are primarily responsible for the reduction 
in message passing by a factor of 4.4 to 127 
messages per iteration. Attempts to further 
reduce message passing are on-going. 


Future Work 


Current effort is focused on further reduc- 
ing the time to reach a solution to a large 
scale, fully reacting combustion simulation to 
three hours. A more sophisticated domain de- 
composition algorithm is being investigated. 
This has become a more critical issue as the 
amount of computational work has decreased 
with the use of the ILDM Kinetics module. In 
addition the mixing of message passing and 
shared memory programming via OpenMP 
will be investigated to enable the efficient use 


I 



of more processors. MPI will continue to 
be used for the existing domain level, coarse 
grained parallelism, and OpenMP will be in- 
vestigated for use with loop level parallelism. 


Concluding Remarks 

The performance of the NCC flow solver, 
CORSAIR-CCD, has been enhanced signif- 
icantly over the past several years. Per- 
formance enhancements have included algo- 
rithm modifications, streamlining of compu- 
tationally intensive code, restructuring the 
communication pattern to eliminate dead- 
lock, and the addition of the ILDM Kinetics 
module which greatly reduced the computa- 
tional requirements of the code. Additional 
improvements can be attributed to hardware 
upgrades over the past few years. It was es- 
timated that the baseline code would require 
more than 500 hours to reach a solution for 
a 1.3 million element problem in 1995. The 
current code is estimated to achieve a solu- 
tion to the same problem within 9 hours. 
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